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SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG 

SELECTOR* 

By Peter J. Bickel , Ya'acov Ritov and Alexandre B. 

- - - , TSYBAKOV 

OO 

I We exhibit an approximate equivalence between the Lasso es- 

, timator and Dantzig selector. For both methods we derive parallel 

oracle inequalities for the prediction risk in the general nonparamet- 
ric regression model, as well as bounds on the £p estimation loss for 
1 < P < 2 in the linear model when the number of variables can be 
much larger than the sample size. 
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. 1. Introduction. During the last few years a great deal of attention 

^5 I has been focused on the ii penalized least squares (Lasso) estimator of pa- 

^ ' rameters in high-dimensional linear regression when the number of variables 

can be much larger than the sample size [8-10, 15, 16, 18-20, 24, 25]. Quite 
recently, Candes and Tao [7] have proposed a new estimate for such linear 
models, the Dantzig selector, for which they establish optimal £2 rate proper- 
ties under a sparsity scenario, i.e., when the number of non-zero components 
Q> , of the true vector of parameters is small. 

I Lasso estimators have been also studied in the nonparametric regression 

setup [2-5, 11, 12, 17]. In particular, Bunea et al. [2-5] obtain sparsity oracle 
inequalities for the prediction loss in this context and point out the implica- 
00 ! tions for minimax estimation in classical non-parametric regression settings, 

■ as well as for the problem of aggregation of estimators. An analog of Lasso 

for density estimation with similar properties (SPADES) is proposed in [G]. 
^ I Modified versions of Lasso estimators (non-quadratic terms and / or penalties 

' slightly different from ii) for nonparametric regression with random design 

are suggested and studied under prediction loss in [1.3, 2.3]. Sparsity oracle 
inequalities for the Dantzig selector with random design are obtained in [14]. 
In linear fixed design regression, Meinshausen and Yu [16] establish a bound 
on the £2 loss for the coefficients of Lasso which is quite different from the 
bound on the same loss for the Dantzig selector proven in [7]. 



> 

in 



o 



*Partially supported by NSF grant DMS-0605236, ISF Grant, and France-Berkeley 
Fund. 

AMS 2000 subject classifications: Primary 60K35, 62G08; secondary 62C20, 62G05, 
62G20 

Keywords and phrases: Linear models. Model selection, Nonparametric statistics 

1 

imsart-aos ver. 2007/02/20 file: BRT_LassoDanPostSumission.tex date: February 2, 2008 



2 



BICKEL ET AL. 



The main message of this paper is that under a sparsity scenario, the Lasso 
and the Dantzig selector exhibit similar behavior, both for linear regression 
and for nonparametric regression models, for £2 prediction loss and for £p 
loss in the coefficients for 1 < p < 2. All the results of the paper are non- 
asymptotic. 

Let us specialize to the case of linear regression with many covariates, 
y = Xp + W where X is the n x M deterministic design matrix, with M 
possibly much larger than n, and is a vector of i.i.d. standard normal 
random variables. This is the situation considered most recently by Candes 
and Tao [ ] and Meinshausen and Yu [IG]. Here sparsity specifies that the 
high-dimensional vector j3 has coefficients that are mostly 0. Our key obser- 
vation is that the deviations from the true regression function of the Dantzig 
selector and of the Lasso estimate, with high probability lie in a region such 
that the contribution to their ii loss from coordinates of (3 which vanish is 
of the same order as the contribution from those which do not. 

We develop general tools to study these two estimators in parallel. For 
the fixed design Gaussian regression model we recover, as particular cases, 
sparsity oracle inequalities for the Lasso, as in Bunea et al. [4], and £2 bounds 
for the coefficients of Dantzig selector, as in Candes and Tao [7]. This is 
obtained as a consequence of more general results, which include: 

• Sparsity oracle inequalities for the Dantzig selector in the nonpara- 
metric regression model under £2 prediction loss. 

• Sparsity oracle inequalities for the Lasso in the nonparametric regres- 
sion model under more general assumptions on the design matrix than 
in [4]. 

• An approximate equivalence between Lasso and Dantzig selector in 
nonparametric regression. 

• We develop geometrical assumptions which are considerably weaker 
than those of Candes and Tao [7] for the Dantzig selector and Bunea 
et al. [4] for the Lasso. In the context of linear regression where the 
number of variables is possibly much larger than the sample size these 
assumptions imply the result of [7] for the £2 loss and generalize it to 
£p loss, 1 < p < 2, and to prediction loss. Our bounds for the Lasso 
differ from those for Dantzig selector only in numerical constants. 

We begin, in the next section, by defining the Lasso and Dantzig procedures 
and the notation. We then give some basic properties of the two procedures, 
introducing notation and two important technical lemmas. In Section 3 we 
develop our key geometric assumptions, and compare them to those of [7] 
and [16] as well as to ones appearing in [4] and [5]. We note a weakness of 
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our assumptions, and hence also these of the authors we cited, and show 
also a way of remedying them. Sections 4, 5 give the equivalence results and 
sparsity oracle inequalities for the Lasso and Dantzig estimators in the gen- 
eral nonparametric regression model. Section 6 focuses on linear regression 
and includes a final discussion. 



2. Basic properties of Lasso and Dantzig solutions. Let {Zi,Yi), 
. . . , {Zn, Yn) be a sample of independent random pairs with 

(2.1) Yi = f{Z,) + W^, i = l,...,n, 

where / : ^ ^ M is an unknown regression function to be estimated, Z is 
a Borel subset of W^, the Zj's are fixed elements in Z and the regression 
errors Wi are Gaussian. Let J^m = {fi, ■ ■ ■ , fu} be a finite dictionary of 
functions fj:Z^M.,j = l,... ,M. We assume throughout that M > 2. 
Depending on the statistical targets, the dictionary J^m can be of differ- 
ent nature. For instance, it can be a collection of basis functions used to 
approximate / in the nonparametric regression model. Another example is 
related to the aggregation problem where the fj are estimators arising from 
M different methods. They can also correspond to M different values of 
the tuning parameter of the same method. Without much loss of generality, 
these estimators fj are treated as fixed functions: the results are viewed as 
being conditioned on the sample the fj are based on. 

For any A = (Ai, . . . , Am) G K*^, define fx{z) = EjLi ^jfj{z). The esti- 
mates we consider are all of the form f^(-) where A is data determined. 

Let 

M 

^(A) = EWo} = 1-^(^)1 

denote the number of non-zero coordinates of A, where /|.} denotes the 
indicator function, J(A) = {j G {1, . . . ,M} : \j / 0}, and \ J\ denotes the 
cardinality of J. The value M(A) characterizes the sparsity of the vector A: 
the smaller M(A), the "sparser" A. 
Introduce the residual sum of squares 

1 " 

1=1 
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for all A e M*^. Define the Lasso solution A = (Ai, . . . , Ajv/) by 

(2.2) A = arg min J 5(A) + 2r^ ||/ill„|A,| I , 

[ t^i J 

where r > is some tuning constant, and introduce the corresponding Lasso 
estimator 

M 

(2.3) /(:c)=f^(x) = ^A,/,(z). 
Here and below || • \\n stands for the empirical norm: 



\\9\\n 



\ 



1 

1 = 1 



for any g : Z ^M.. 

The criterion in (2.2) is convex in A, so that standard convex optimization 
procedures can be used to compute A. We refer to [9, 18, 19, 22] for detailed 
discussion of these optimization problems and fast algorithms. 

For a vector A G and a subset J C {1, . . . , M} we denote by Aj the 
vector in M*'^ which has the same coordinates as A on J and zero coordinates 
on the complement J'^ of J. 

We also introduce the matrix X = {fj{Zi))ij, i = 1, . . . ,n, j = 1, . . . , M 
and the vectors y = {Yi,..., y„)^, f = (/(Zi), . '. . , /(Z„))^, W = (W^ . . . , WnV . 
We will write \x\p for the £p norm of x G M^^, 1 < p < oo. 

With this notation, 

y = f + W. 

The Dantzig estimator of the regression function / is defined by 

M 

(2.4) f^(z)=f^^Jz) = J2x^^of.(z). 

i=i 

where Xd = (Ai,D) • • • > Xm,d) is the Dantzig selector, i.e., a solution of the 
minimization problem 

(2.5) Ad = argmin||A|i : -D"^^^X^ {y - XX) < r] 

with some r > and the diagonal matrix 

Z) = diag{||/i||2,...,||/,,||2}. 
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Here and below we suppose that \\fj\\n 7^ 0, j = 1, . . . , M. Set 
/max = max , /min = min ||/j||„ . 

The Dantzig selector is computationally feasible, since it reduces to a linear 
programming problem [7]. 



It is easy to see that the Lasso solution obeys the Dantzig constraint. 
In fact, the necessary and sufficient condition of the minimum in (2.2) is 
that belongs to the subgradient of the convex function A i— > n~^\y — 
XA|2+2r|D^/^A|i. This implies that the Lasso selector A satisfies the Dantzig 
constraint: 



(2.6) -D-^/^X^{y - XX) 

n 



< r. 

oo 



Therefore, by the definition of Dantzig selector, we have {XdIi < |A|i. 

We conclude this section with two lemmata, whose proofs are given in the 
appendix. 



Lemma 1. Let Wi he independent M cr^) random variables with o"^ > 
and let f be the Lasso estimator defined by (2.3) with 



r = Aa\ 



llogM 



n 



for some A > 2\/2. Then for all M > 2, n > 1, with probability of at least 
1 — M^~^ /® we have simultaneously for all A G M*^.- 



M 



(2.7) 



-/ii^ + ^Eii/iii-iA.-A,i 

< ||f,-/||2+4r J2 ll/jlln|A,-A, 
ieJ{A) 



and 
(2.8) 



< ||fA-/||^ + 4r^M(A) E ||/^.||2|A^._A,|2, 

V j^JW 



-X^(f-XA) <3r/max/2. 

n oo 
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Furthermore, with the same probability 

(2.9) M(A) < 4,/.^ax/-il (II/ - fWl/r^) 

where (pma-x denotes the maximal eigenvalue of the matrix X'^X/n. 



Lemma 2. Let A G 



satisfy the Dantzig constraint 



(2.10) 



-D-^I^X^{y-XX) 
n 



< r 



and set A = A_d — A, Jo = ^(A). Then 
(2.11) |Aj.|i<|AjJi. 

Further, let the assumptions of Lemma 1 be satisfied with A > \f2. Then for 
all M > 2, n> 1 with probability of at least 1 — M^""^^/^ we have 



(2.12) 



n 



X^(f - XXd) < 2r/„ 



3. Restricted eigenvalue assumptions. For any n > 1, M > 2, 

consider the Gram matrix 

^i>^ = ^x^x=UY.Mz.)ff(z^)) 

\ i=l / l<j,j'<M 

We now introduce the key assumptions on the Gram matrix that are needed 
to guarantee nice statistical properties of Lasso and Dantzig selector. Under 
the sparsity scenario we are typically interested in the case where M > n, 
and even M ^ n. Then the matrix is degenerate, which can be written 
as 

(A^M'„A)V2 |XA|2 ^ 

mm j — j = mm — — = U. 

Clearly, ordinary least squares does not work in this case, since it requires 
positive definiteness of "^n, i-e. 

3.1 min ' , , , > 0. 

AeiR«:A^o V'^|A|2 
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It turns out that the Lasso and Dantzig selector require much weaker as- 
sumptions: the minimum in (3.1) can be replaced by the minimum over a 
restricted set of vectors, and the norm | A|2 in the denominator of the con- 
dition can be replaced by the 1% norm of only a part of A. The resulting 
conditions will be referred to as restricted eigenvalue (RE) assumptions. 

Our first RE assumption is stated as follows, where s is an integer such 
that 1 < s < Af , and cq is a positive number: 

Assumption RE(s, cq): 

( \^ ■ ■ I^A|2 ^ „ 

k(s,Co) = mm mm — — > 0. 

JoC{l,...,A/}:|Jo|<^* A^O:|Ajc|i<co|Aj„|i V^I-^Job 



The integer s here plays the role of an upper bound on the sparsity A/(A) 
of a vector of coefficients A. We will usually interpret Jq as the set of non- 
zero coefficients of A. To explain the role of the constant cq, we may note 
that the vector of Dantzig residuals A satisfies |Ajc|i < co|Ajp|i with 
Co = 1, cf. (2.11). Similar inequality holds for the vector of Lasso residuals 
A = A — A, but this time with cq = 3, and with probability 1 — M^""^ 
in the particular case of Lemma 1 where A is such that Hf^ — = and 
||/,-|U = l (cf. (2.7)). 

To introduce the second assumption we need some more notation. For 
integers s, m such that 1 < s < M/2 and m > s, s + m < M, for a vector 
A G M^^ and a set of indices Jq C {1,...,M} with |Jo| < s, denote by 
J\ the subset of { I , . . . , M} corresponding to m largest in absolute value 

coordinates of A outside of Jq and define Jqi = Jq U Ji. 

Assumption RE(s, m, cq): 

( \^ ■ ■ I^A|2 ^ „ 
K(s,m, Co) = mm mm — > 0. 

JoC{l,...,M}:lJol<s A^O:|Ajc|i<co|AjJi V^I^Joib 



Note that Assumption RE(s, co) is less restrictive than RE(s, m, co). 
For our bounds on the prediction loss and on the l\ loss of the Lasso and 
Dantzig estimators we will only need Assumption RE(s, co). The stronger 
Assumption RE(s, m, co) will be required exclusively for the bounds on the 

loss with 1 < p < 2. 

Note also that Assumptions RE(s', co) and RE(s', m, co) imply Assump- 
tions RE(s, Co) and RE(s, m, cq) respectively if s' > s. 
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Assumptions RE(s, cq) and RE(s, m, cq) are implied by several simple 
sufficient conditions. We now consider some of them. 

For a real number 1 < u < M we introduce the following "restricted" 
eigenvalues: 

(Prnmiu) = mm I 12 , (Praa.AU) = max , ,o ■ 

a;GM«:l<Af(x)<n \x\2 xeR'^' ■.l<M{x)<u \x\2 

Denote by Xj the n x | J| submatrix of X obtained by removing from X the 
columns that do not correspond to the indices in J, and for 1 < m,m' < M 
introduce the "restricted" correlations 

^m,m' = max { —CjXjXj'Cj' : JDJ' = 0, |J| < m, \ < m' , \cj\2 < 1, |cj'|2 < 

where cj G IRI-^I, cj/ G RI-^'L 

A sufficient condition for RE(s, cq) and RE(s, m, cq) with m = s to hold 
is given, for example, by the following assumption on the Gram matrix. 

Assumption 1. Assume 

(2s) > coe'.^as 

for some integer 1 < s < M/2 and a constant cq > 0. 

This condition with cq = 1 appeared in [7], in connection with the Dantzig 
selector. Assumption 1 is more general: we can have here an arbitrary con- 
stant Co > which will allow us to cover not only the Dantzig selector but 
also the Lasso estimators, and to prove oracle inequalities for the prediction 
loss when the model is nonparametric. 

Our second sufficient condition for RE(s, cq) and RE(s, m, cq) does not 
need bounds on correlations. Only bounds on the minimal and maximal 
eigenvalues of "small" submatrices of the Gram matrix are involved. 

Assumption 2. Assume 

for some integers s, m such that 1 < s < M/2, m > s, and s + m < M , and 
a constat cq > 0. 

Assumption 2 can be viewed as a weakening of the condition 
[16]. Indeed, taking s + m = slogn (we admit w.l.o.g. that slogn is an 
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integer and n > 3) and assuming that i^max(') is uniformly bounded by a 
constant we get that Assumption 2 is equivalent to 

(3.2) (/)min(slogn) > c/logn 

where c > is a constant. The corresponding slightly stronger assumption 
in [16] is stated in asymptotic form (for s = s„ — > oo): 

liminf (/)min(srilogn) > 0. 



The following two constants are useful when Assumptions 1 and 2 are 
considered: 

Kl(s,Co) = y^0mm(2s) [ 1 



CQ "s,2s 
0min(2s) 



and 



K2{s,m,Co) = ^J(j)mm{s + m) ( 1 




,(m) 



i(s + m) 



The next lemma shows that if Assumptions 1 or 2 are satisfied, then the 
quadratic form x^'^nX is positive definite on some restricted sets of vectors 
X. The construction of the lemma is inspired by Candes and Tao [7] and 
covers, in particular, the corresponding result in [7]. 

Lemma 3. Fix an integer 1 < s < M/2 and a constant cq > 0. 

(i) Let Assumption 1 he satisfied. Then Assumptions RE(s, cq) and RE(s, 
s, Co) hold with k(s,co) = k(s,s,co) = ki(s,co). Moreover, for any subset 
Jo o/ {1, . . . , M} with cardinality \Jq\ < s, and any A G M*-'^ such that 

(3.3) |Ajc|i < colAjJi 

we have 

(3.4) -^\PoiXA\2>Kiis,co)\Ajj2 



where Pqi ^-5 ^/^e projector in M*^ on the linear span of the columns of Xj^-^ . 

(ii) Let Assumption 2 he satisfied. Then Assumptions RE(s,cq) and RE(s,m,CQ) 
hold with k{s,co) = K{s,m,co) = K2{s,m,co). Moreover, for any subset Jq 
o/{l,...,M} with cardinality \ Jo\ < s, and any A e R*^ such that (3.3) 
holds we have 

(3.5) -^|Poi^A|2 > K2(s,m,co)|AjoJ2- 
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There exist other sufficient conditions for Assumptions RE(s,co) and RE(s,m,co) 
to hold. We mention here two of them implying Assumption RE(s,co). The 
first one is the following [1]. 

Assumption 3. For an integer s such that 1 < s < M we have 

where cq > is a constant. 

To argue that Assumption 3 implies RE(s,co) it suffices to remark that 
(3-6) l\XA\l > ^AlxlXj.Aj, - l\AlxlXj.Aj.\ 
> ct>^M\Aj,\l-^\AlxlXjcAjc\ 

and, if (3.3) holds, 

< Ss,l\^J^^\l\^Jo\2 

< Co9s,iy/s\Ajf,\l. 

Another type of assumption related to "mutual coherence" [n] is discussed 
in the connection to Lasso in [4, 5]. We state it here in a slightly different 
form. 

Assumption 4. For an integer s such that 1 < s < M we have 

(3.7) 0min(s) > 2co6'i,is 
where cq > is a constant. 

It is easy to see that Assumption 4 implies RE(s,co). Indeed, if (3.3) holds, 

(3.8) ^\XA\l > iAj„x3;,Xj„Aj„-20i,i|Ajc|i|AjJi 

> </.min(s)|Ajj| - 2co01,l|AjJ? 

> (0min(s) - 2co01,ls)|AjJ^. 

If all the diagonal elements of matrix X'^X/n are equal to 1 (and thus di^i 
coincides with the mutual coherence [n]), a simple sufficient condition for 
Assumption RE(s,co) to hold is given by 

(3.9) ^M<7T-V^- 

(1 + 2co)s 
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In fact, separating the diagonal and off-diagonal terms of the quadratic form 
we get 

Combining this inequality with (3.8) we see that Assumption RE(s,co) is 
satisfied whenever (3.9) holds. 

Unfortunately, Assumption RE(s,co) has some weakness. Let, for example, 
fj, j = 1, . . . , 2™ - 1, be the Haar wavelet basis on [0, 1] (M = 2™) and 
consider Zi = i/n, i = 1, . . . , n. If M ^> n, it is clear that 0min(l) = since 
there are functions fj on the highest resolution level whose supports (of 
length M"^) contain no points Z^. So, none of the Assumptions 1-4 holds. 
Intuitively, the problem arises only because we include very high resolution 
components. Therefore, we may try to restrict the set Jq in RE(s,co) to 
low resolution components, which is quite reasonable because the "true" or 
"interesting" vectors of parameters A are often characterized by such Jq. 
This idea is formalized in Section 5, cf. Corollary 1, see also a remark after 
Theorem 6.2 in Section 6. 



4. Approximate equivalence. In this section we prove a type of ap- 
proximate equivalence between Lasso and Dantzig selector. It is expressed as 
closeness of the prediction losses ||/d — /||n ^^'^ \\f ~ fWn when the number 
of non-zero components of Lasso or Dantzig selector is small as compared 
to the sample size. 

Theorem 4.1. Let Wi be independent M{0,a'^) random variables with 
> 0. Let Assumption RE(s, 1) be satisfied with 1 < s < M. Consider the 
Dantzig estimator Jd defined by (2.4) - (2.5) with 

r = AaJ^ 
V n 

and the Lasso estimator f defined by (2.2) - (2.3) with the same r. Then, 
for all n > 1 and A > \/2 with probability at least 1 — M^~^^/^ we have that 
if M (A) < s then 

(4.1) ||/,-/|i;<||/-/||5+l^%i!^(M^) 
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and for A > 2^/2 with probability at least 1 — "^^/^ we have that if 
M(A) < s, 



\ n 



(4.2) ||/_;||J<||/„_;||;+£^fmM^- 

where k = k{s, 1). 



Proof. Set A = A - Az). We have 

-|f-XA|^ = -\i-XXD\l--A^X^{i-XXD) + -\XA\l. 

n n n n 

This and (2.12) yield 

1 



- -\XA\ 

oo n 



(4.3) < \\f-f\\i + 2\A\^-X'{^-XXD) 

n 

< ||/-/||^ + 4/.,axr|A|i-i|XA|2 

where the last inequality holds with probability at least 1 — M^~^^^'^. Since 
the Lasso solution A satisfies the Dantzig constraint, we can apply Lemma 
2 with A = A, which yields 

(4.4) |Ajc|i < lAjJi 
with Jo = J(A). By Assumption RE(s, 1) we get 



(4.5) — |XA|2 >K|Ajj2 



1 



where k = k{s, 1). Using (4.4) and (4.5) we obtain 
(4.6) |A|i < 2|AjJi < 2M'/\X) |Ajj2 < 7^I^A|2. 



Finally, from (4.3) and (4.6) we get that, with probability at least 1 
WfD-fWl < 11/ - fWl + "^''^'(^^ |XA|2 - kxA\l 
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by (2.8) and (2.12). This proves (4.1). 

To show (4.2) we act as in (4.3), up to the inversion of roles of A and A^, 
and we use (2.8). This yields that, with probability at least 1 — M^~'^ 



(4.7) Wf-fWi < WfD- f\\'i + 2\A\, -X^{i- XX) 

n 

< ||/D-/||n + 3/maxr|A|i--|XA|2. 

n 

The proof of (4.2) now parallels that of (4.1) up to a difference in numerical 
constants. 



1 

oo n 



-\XA\i 



We also have the following result that we state for simplicity under the 
assumption that \\fj\\n = 1, j = 1, ■ ■ ■ ,M. 



Theorem 4.2. Let Wi he independent M{Q,a'^) random variables with 
> 0, and let \\fj\\n = 1, J = 1,...,M. Let Assumption RE(s, 5) be 
satisfied for some 1 < s < M. Consider the Dantzig estimator fo defined 
by (24) - (2.5) with 




and A > 2\/2. Let f be the Lasso estimator defined by (2.2) - (2.3) with the 
same r. Then, for all n > 1 with probability at least 1 — M^~^ we have 
that if M^Xjj) < s then 

ll/-/esio||/.-/|li+?l4^fM<Mi2iH) 

where k = k{s, 5). 



Proof. Set again A = A — Ad- We apply (2.7) with X = Xd which yields 
that, with probability at least 1 - M^~^^/^, 

(4.8) |A|i<4|AjJi + ||/D-/||^/r 

where now Jq = J{Xd)- Consider the two cases: (i) — f\\n > 2r|AjQ|i 
and (ii) Wfo - f\\l < 2r|AjJi. In case (i) inequality (4.7) with /max = 1 
immediately implies 

(4.9) ||/-/||?.<10||/d-/||^ 
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and the theorem follows. In case (ii) we get from (4.8) that 

(4.10) |A|i<6|AjJi 

and thus |Ajc|i < 5|Ajq|i. We can therefore apply Assumption RE(s, 5) 
which yields, similarly to (4.6), 

(4.11) |AK < 6MV2(a^) |A,„b < ^-^^^!^\X^\, 



where k = k{s, 5). Plugging (4.11) into (4.7) we finally get that, in case (ii), 
(4.12) Wf-nl < ||/.-/||^+ ^^^^'^^^"^ |XAb-l|XA|i 



Remark. The approximate equivalence is essentially that of the rates as 
Theorem 4.1 exhibits. A statement free of M(A) holds for linear regression, 
see discussion after Theorem 6.2 and Theorem 6.3 below. 



5. Oracle inequalities for prediction loss. Here we prove sparsity 
oracle inequalities for the prediction loss of Lasso and Dantzig estimators. A 
general discussion of sparsity oracle inequalities can be found in [21]. Such 
inequalities have been recently obtained for the Lasso type estimators in a 
number of settings [2-6, 13, 23]. In particular, the regression model with 
fixed design that we study here is considered in [2-4]. The assumptions on 
the Gram matrix \I'„ in [2-4] are more restrictive than ours: in those papers 
either is positive definite or a mutual coherence condition similar to (3.9) 
is imposed. 

Theorem 5.1. Let Wi be independent M{0,a^) random variables with 
cr^ > 0. Fix some e > and an integer 1 < s < M. Let Assumption RE(s, 
cq) be satisfied with cq = 3 + 4/e. Consider the Lasso estimator f defined by 
(2.2) - (2.3) with 
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for some A > 2\/2. Then, for alln> 1 with probability at least 1 — M^~"^^/^ 
we have 

wl-ni 

■ 1^/. V n 



M{X)<s ^ 

where k = k(s,3 + 4/e) and C(e) > is a constant depending only on e. 

Proof. Fix an arbitrary A G R^^ with M(A) < s. Set A = D^l'^(\ - A), 
Jo = J(A). On the event we get from the first hne in (2.7) that 

(5.2) ||/-/||2+r|A|i < ||f;,-/||2+4^^ ||/^.||„|A^._A^.| 

ieJo 

= ||fA-/||2+4r|AjJi, 
and from the second hne in (2.7) that 

(5.3) < ||f,-/||2+4ryM(A)|Ajj2. 
Consider separately the cases where 

(5.4) 4r|AjJi < e||fA-/||^ 
and 

(5.5) e||fA-/||^<4r|AjJi. 

In case (5.4), the result of the theorem trivially follows from (5.2). So, we 
will only consider the case (5.5). All the subsequent inequalities are valid on 
the event where A\ is defined by (5.5). On this event we get from 

(5.2) that 

(5.6) |A|i < 4(l + l/e)|AjJi 

which implies |Ajc|i < (3 + 4/e)|AjQ|i. We now use Assumption RE(s, 
3 + 4/e). This yields 

(5.7) K^IAjji < l|XA|2 = i(A-A)^I)i/2x^XZ)i/2(A_A) 

n n 
p2 



< li^i^x-\Yx'x{x-\) = flj\f-h 

n 
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where k = k{s,3 + 4/e). Combining this with (5.3) we find 



^ ^ ||/-/||^< ||fA-/||^ + 4r/„,,.K-i0l^||/-fA||„ 

(5.8) \ , ^ , 

< llfA - fWl + 4r/^axA^-V^W (11/ - fWn + \\h - fWn) ■ 

This inequahty is of the same form as (A. 4) in [4] . A standard decouphng 
argument as in [4] using inequahty 2xy < + by'^ with b > 1, x = 

rK~^ M{X), and y being either ||/ — or ||fA — f\\n yields that 



(5.9) ||/-/||^<^||f,-/||2 + |^!^,2^(A), V6>1. 
Taking b = 1 + 2/e in the last display finishes the proof of the theorem. 



We now state as a corollary a softer version of Theorem 5.1 that can be 
used to eliminate the pathologies mentioned at the end of Section 3. For this 
purpose we define 



r \XA\2 } 

Js,7,co = wo C {1,... ,M} : |Jo| < s and min — p>7r 



\XA\ 

where 7 > is a constant, and set 

As,7,co = {A : J(A) G Js,y,co}- 

In similar way, we define j7s,'y,m,co and As^^^m,co corresponding to Assumption 
RE(s, m, Co). 

Corollary 1. Let Wi, s and the Lasso estimator f be the same as in 
Theorem 5.1. Then, for all n > 1 and e > 0, 7 > 0, with probability at least 
1 - M^-^'/® we have 

\\f-f\\l 

^'■''^ <(!+.) inf (iif.-/ii-+^^-^v^-^ r^^^^^°^^ 

where A^^^^e = {A G A^^^^3+4/j : M(A) < s}. 

To obtain this corollary it suffices to observe that the proof of Theorem 
5.1 goes through if we drop Assumption RE(s,co) but we assume instead 
that A G ^3,^,3+4/6 a.nd we replace k by 7. 
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We would like now to get a sparsity oracle inequality similar to that of 
Theorem 5.1 for the Dantzig estimator //). We will need a mild additional 
assumption on /. This is due to the fact that not every A G obeys to 
the Dantzig constraint, and thus we cannot assure the key relation (2.11) for 
all A G M*^. One possibility would be to prove inequality as (5.1) where the 
infimum on the right hand side is taken over A satisfying not only M(A) < s 
but also the Dantzig constraint. However, this seems not very intuitive since 
we cannot guarantee that the corresponding gives a good approximation 
of the unknown function /. Therefore we choose another approach (cf. [5]): 
we consider / satisfying the weak sparsity property relative to the dictionary 
/i, . . . , /jv/. That is, we assume that there exist an integer s and constant 
Co < oo such that the set 

(5.11) A, = |a G R'' : M(A) < s, \\h - f\\l < ^^^^^M{\) 

is non-empty. Here k is the same as in Theorem 5.1. The second inequality 
in (5.11) says that the "bias" term ||fA~/||n cannot be much larger than the 
"variance term" ~ f^^^r'^K~'^M{X), cf. (5.1). Weak sparsity is milder than 
the sparsity property in the usual sense: the latter means that / admits 
the exact representation / = f^* for some A* G M^^, with hopefully small 
M(A*) = s. 



Corollary 2. Let Wi he independent M{0,(t'^) random variables with 
> 0. Fix some e > 0. Let f obey the weak sparsity assumption for some 
Co < oo and some s such that 1 < max(Ci(e), l)s < M where 



Ci(e) =4[(l + e)Co + C(e)] 



J mi 



^max./max 



andC{e) is the constant in Theorem 5.1. Let Assumption RE(mayi{Ci{e),l)s, 
cq) be satisfied with cq = 3 + 4/e. Consider the Dantzig estimator fo defined 
by (2.4) - (2.5) with 



'logM 



V n 

and A > 2\/2. Then, for all n > 1, with probability at least 1 — M^~^^/^ we 
have 

WJd - ft 
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Here C2{e) = 16Ci(e) + C{e) and kq = «:(max(Ci(e), l)s, 3 + 4/e). 

Proof. Due to the weak sparsity assumption there exists A G M.^^ with 
M(A) < s such that ||fx - f\\l < Cq f^i^^r^ M {\) where k = k{s, 3 + 4/e) 
is the same as in Theorem 5.1. Using this together with Theorem 5.1 and 
(2.9) we obtain that, with probabiUty at least 1 — Af ^""^ 

M(X) < Ci(e)M(A) < Ci(e)s. 

This and Theorem 4.1 imply 

Kq \ n / 

where kq = K(max(Ci(e), l)s, 3 + 4/e). Applying Theorem 5.1 once again 
we get the result. 

Note that the sparsity oracle inequality (5.12) is slightly weaker than 
the analogous inequality (5.1) for the Lasso: we have here u\ix^^M.]^[(^X)=s 
instead of inf^giRA/, m(a)<s in (5.1). 

6. Special case: linear regression. In this section we assume that 
the vector of observations y = {Yi, . . . , Yn)^ is of the form 

(6.1) y = Xl3* + W 

where X is an nxM deterministic matrix, f3* G M.^^ and W = {Wi, . . . , Wn)'^ ■ 
We do not assume that f3* is uniquely defined. On the contrary, we expect 
to have M at least of order of n and typically much larger. In this case, if 
f3* = /3q satisfies (6.1) there exists an (M — n)-dimensional affine space 
{/3* : X(3* = XPq} of vectors satisfying (6.1). The results of this section are 
valid for any /3* such that (6.1) holds, in particular, for /3** that gives the 
sparsest representation of E{y), i.e., such that 

(3** = arg min Af(/3*). 

(3*:X(3'=E{y) 

Our goal is to estimate both XP* for purposes of prediction and (3* itself for 
purposes of model selection. We will see that meaningful results are obtained 
when the sparsity index M(/3*) is small. 
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It will be assumed throughout this section that the diagonal elements of 
the matrix X"^ X/n are all equal to 1 (this is equivalent to the condition 
= 1, J = 1,...,M, in the notation of previous sections). Then the 
Lasso estimator of /3* in (6.1) is defined by 

(6.2) 3 = arg min \-\y-Xfi\l + 2r\(3\i\. 



The correspondence between the notation here and that of the previous 
sections is the following: for /3 = A we have 

\\h\\l = \Xf3\l/n, \\h-f\\l = \X{f3-(3*)\l/n, \\f- f\\l = \X CP-(i*)\l/n. 

The Dantzig selector for linear model (6.1) is defined by 

(6.3) 3^, = argmin|/3|i 

/3ga 

where 

< r 



A=f/3GM*^: -X^{y-XI3) 



oo 



is the set of all f3 satisfying the Dantzig constraint. 

We first get bounds on the rate of convergence of Dantzig selector. 

Theorem 6.1. Let Wi be independent M{0,a^) random variables with 
o"^ > 0, let all the diagonal elements of the matrix X'^X/n be equal to 1, 
and M{j3*) = s where 1 < s < M. Let Assumption RE(s,l) be satisfied. 
Consider the Dantzig selector j3j^ defined by (6.3) with 



Aa\ 



'logM 



n 



and A > \f2. Then, for all n > 1, with probability at least 1 — ^ we 
have 



(6.4) |/3d -/3 li < — ^^V 



16/42 

(6.5) \x{l5^-l5-)\\< a'^slogU 

where k = k{s,1). In addition, if Assumption RE(s,m,l) is satisfied, then 
with the same probability as above, simultaneously for all 1 < p < 2 we have 

p 



(6.6) -/3*|^< 2^-18 {l + y^} 

where k = k{s, m, 1). 



^I^^P"^) I Aa /logM 
s 



V n 
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Note that, since s < m, the factor in curly brackets in (6.6) is bounded 
by a constant independent of s and m. Under Assumption 1 with cq = 1 
(which is less general than RE(s,s,l), cf. Lemma 3(i)) a bound of the form 
(6.6) for the case p = 2 is estabhshed by Candes and Tao [7]. 

Bounds on the rate of convergence of the Lasso selector are quite similar 
to those obtained in Theorem 6.1. They are given by the following result. 

Theorem 6.2. Let Wi be independent J\f{0,a^) random variables with 
cr^ > 0. Let all the diagonal elements of the matrix X'^X/n be equal to 1, 
and M(/3*) = s where 1 < s < M . Let Assumption RE(s,3) be satisfied. 
Consider the Lasso selector j3 defined by (6.2) with 



r = Aa\ 



'logM 



n 



and A > 2\f2. Then, for all n > 1, with probability at least 1 — "^^/^ we 
have 



(6.7) |p-p |i < —jr^^\ 



\ n 



16/42 

(6.8) |x(/3-/3*)|2<^a2,logM, 

(6.9) M(3)<^^^s 



where k = k(s,3). In addition, if Assumption RE(s,m,3) is satisfied, then 
with the same probability as above, simultaneously for all 1 < p < 2 we have 



where k = K{s,m,3). 



n 



Assumptions RE(s, 1) respectively RE(s, 3) can be dropped in Theorem 
6.1 and 6.2 if we assume (3* E ^s,"f,co with cq = 1 or cq = 3 as appro- 
priate. Then (6.4), (6.5) or respectively (6.7), (6.8) hold with k = j. This 
is analogous to Corollary 1. Similarly (6.6) and (6.10) hold with k = 7 if 
P* G ^s,'y,m,co with Co = 1 or Co = 3 as appropriate. 

Observe that combining Theorems 6.1 and 6.2 we can immediately get 
bounds for the differences between Lasso and Dantzig selectors |/3 — /fl/^l^ 
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and \X{P — /9/))|2- Such bounds have the same form as those of Theorems 
6.1 and 6.2, up to numerical constants. Another way of estimating these 
differences follows directly from the proof of Theorem 6.1. It suffices to 
observe that the only property of f3* used in that proof is the fact that /3* 
satisfies the Dantzig constraint, which is also true for the Lasso solution f3. 
So, we can replace f3* by (3 and s by M{(3) everywhere in Theorem 6.1. 
Generalizing a bit more, we easily derive the following fact. 

Theorem 6.3. The result of Theorem 6.1 remains valid if we replace 
there - /3*|p by supll^^, - /3|p : /3 G A, M(/3) = s] forl<p<2 and 
|A(3£, - f3*)\l by sup{|A(3£, - /3)|| : /3 e A,M(/3) = s} respectively. Here 
A is the set of all vectors satisfying the Dantzig constraint. 

Remarks. 

1. We would like to emphasize that Theorems 6.1 and 6.2 are true for any 
f3* satisfying (6.1), in particular, when the parameter /3* is non- identifiable. 
Even more. Theorem 6.3 applies to certain values of /3 that do not come 
from the model (6.1) at all. Note that Assumptions RE(s,l) and RE(s,m,l) 
do not imply identifiability. In fact, they do not guarantee that 0min(2s) > 
which is an evident necessary condition for identifiability, cf. [7]. The lack 
of identifiability is not a contradiction, even when we deal with the Ip loss 
on the coefficients. Indeed, Theorems 6.1 and 6.2 only give non-asymptotic 
upper bounds on the loss, with some probability and under some conditions. 
The probability depends on M and the conditions depend on n and M: recall 
that Assumptions RE(s,l) and RE(s,m,l) are imposed on the nx M matrix 
X. To deduce asymptotic convergence (as n — > oo and/or as M ^ oo) from 
Theorems 6.1 and 6.2 we would need some very strong additional properties, 
such as simultaneous validity of Assumption RE(s,l) or RE(s,m,l) (with one 
and the same constant k) for infinitely many n and M. 

In particular, we see that the identifiability argument emphasized by Can- 
des and Tao [7] to justify a qualified positivity of (pmmi'^s) in their conditions 
is not really a matter of importance. We get the same and more general re- 
sults without identifiability. What is more, we can use Theorems 6.1 - 6.3 in 
a paradoxical way, aiming to deduce some geometric facts from probabilis- 
tic statements, for example: "in very high dimensions M and for reasonably 
large sample sizes n the set of all very sparse vectors f3* satisfying the model 
(6.1) is necessarily very well concentrated". 

2. For the smallest value of A (which is ^4 = 2\/2) the constants in the 
bound of Theorem 6.2 for the Lasso are larger than the corresponding nu- 
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merical constants for the Dantzig selector given in Theorem 6.1, again for 
the smahest admissible value A = ^/2. There is not much margin for im- 
provement, which probably suggests that for the parametric linear model 
(6.1), under the assumption that all the diagonal elements of the matrix 
n are equal to 1, the Dantzig selector might be better than Lasso. 
However, this remark should be considered with caution, since Theorems 
6.1 and 6.2 only give upper bounds. Note also that Dantzig selector has 
certain defects as compared to Lasso when the model is nonparametric, as 
discussed in Section 5. In particular, to obtain sparsity oracle inequalities 
for Dantzig selector we need some restrictions on /, for example the weak 
sparsity property. On the other hand, sparsity oracle inequality (5.1) for the 
Lasso is valid with no restriction on /. 

3. Proofs of Theorems 6.1 and 6.2 differ mainly in the value of the tuning 
constant: cq = 1 in Theorem 6.1 and cq = 3 in Theorem 6.2. Note that since 
the Lasso solution satisfies the Dantzig constraint we could have obtained a 
result similar to Theorem 6.2, though with less accurate numerical constants, 
by simply conducting the proof of Theorem 6.1 with cq = 3. However, we act 
differently: we deduce (A. 17) directly from (2.7), and not from (A. 11). This 
is done only for the sake of improving the constants: in fact, using (A. 11) 
with Co = 3 would yield (A. 17) with the doubled constant on the right hand 
side. 

4. For Dantzig selector in the linear regression model and under Assump- 
tions 1 or 2 some further improvement of constants in the ip bounds for 
the coefficients can be achieved by applying the general version of Lemma 3 
with the projector Pqi inside. We do not pursue this issue here. 

5. All our results are stated with probabilities at least 1 — M^~^ 
or 1 - M^-^'/s. These are reasonable (but not the most accurate) lower 
bounds on the probabilities P(i3) and P(^) respectively: we have chosen 
them just for readability. Inspection of (A.l) shows that they can be refined 
to 1 - 2M<^{Ay/logM) and 1 - 2M^{A^y\og M/2) respectively where ^>(-) 
is the standard normal c.d.f. 



APPENDIX A: PROOFS 

Proof of Lemma 1. The result (2.7) is essentially Lemma 1 from [':>]. For 
completeness, we give its proof. Set Vnj = r\\fj\\n- By definition, 

M M 

S{X) + 2Y,rn,,\Xj\ < 5(A) + 2 5]r„,,|Aj| 
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for all A e M , which is equivalent to 

^ M M n ^ 

ll/-/lln + 2^r-„,,|A,-| < ||f,-/||2+2^r„,,|A,| + -5:W^,(/-fA)(Z0 
i=i 3=1 



i=l 



Define the random variables Vj = n J2^=i fj{Zi)Wi, 1 < i < M, and the 
event 



M 



A=f]{2\V,\<r^,,}. 

Using an elementary bound on the tails of Gaussian disribution we find that 
the probability of the complementary event A'^ satisfies 



M 



log M 



< Mexp 



nr 
8^ 



Mexp 



where rj ~ M{0, 1). On the event A we have 

M M M 

\\f-f\\l < \\h - f\\l + Y.^n,fX, - Xj\+Y.'^^nJXj\-Y.'^^n,fy. 

i=i i=i 



Adding the term J2fLi'i"n,j\Xj — \j\ to both sides of this inequality yields, 
on A, 



M 



M 



11/ - ffn + T.^n,j\Xj - A, I < llfA - fWl + 2^r„,, (I A, - A, I + |A,| - |A,|) . 

Now, |Aj — Xj\ + \Xj\ — \Xj\ = for j J(A), so that on A we get (2.7). 
To prove (2.8) it suffices to note that on A we have 



(A.2) 



n 



< r/2. 



Now, y = f + H^, and (2.8) follows from (2.6), (A.2). 

We finally prove (2.9). The necessary and sufficient condition for A to be 
the Lasso solution can be written in the form 



(A.3) 



~^Ji)(y ~ = ^ll/ilUsign(Aj) if Xj / 0, 



-x5)(,-XA) 



< r 



if Xj = 
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where xq) denotes the jth column of X, j = 1, . . . , M. Next, (A. 2) yields 
that on A we have 



(A.4) 



1 rj. 



n 



< r\\fj\\n/2, j = l,...,M. 



Combining (A. 3) and (A.4) we get 
1 



(A.5) 
Therefore 
1 



-x^^.)(f-XA) > r\\fj\\n/2 if /O. 



, M 

(f-XXfXX^if-XX) = 5^(x5)(f-XA) 

-2 E (-5)(f-^^ 



= M(A)r2||/,||2/4>/2.^M(A)rV4. 
Since the matrices X'^X/n and XX'^ /n have the same maximal eigenvalues, 



,(f - XXfXX^ii -XX)<^\i- xx\l = (/-^axll/ - /II 



n 



and we deduce (2.9) from the last two displays. 

Proof of Lemma 2. Inequality (2.11) follows immediately from the defini- 
tion of Dantzig selector, cf. [7]. To prove (2.12) consider the event 

r 1 . M 

B = l\^D-^/'X^w\^ < r i = n {l^.l < rn,j} • 
°° j=i 

Analogously to (A.l), FiB""} < M^'^^/'^. On the other hand, y = i + W 
and using the definition of Dantzig selector it is easy to see that (2.12) is 
satisfied on B. 



Proof of Lemma 3. Consider a partition Jq into subsets of size m, with 
the last subset of size < ?n: Jq = uj^^^J^ where K > 1, \ Jk\ = m for k = 
1, . . . , -fC — 1 and \Jk\ ^ n^, such that Jk is the set of indices corresponding 
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to m largest in absolute value coordinates of A outside Dj^IJj (for k < K) 
and Jk is the remaining subset. We have 

K 

(A.6) |Poi^A|2 > |Poi^Ajoj2- lE^oi^Aj, 

k=2 ^ 

K 

k=2 
K 

> |XAj„j2 - E l^oi^AjJs. 

k=2 

We will prove first part (ii) of the lemma. Since for k > 1 the vector Aj^. 
has only m non-zero components we obtain 

(A.7) -^|Poi^Ajj2 < ^I^Ajj2 < vt^|Ajj2. 

Next, as in [7], we observe that |Aj^^j2 < l^Jkl^/ ^ — ^, ■ ■ ■ , K — 1, 
and therefore 

where we used (3.3). From (A.6) - (A. 8) we find 



(A.9) 




which proves part (ii) of the lemma. 

The proof of part (i) is analogous. The only difference is that we replace 
in the above argument m by s and instead of (A.7) we use the following 
bound (cf. [7]): 

(A-10) ^|Poi^A,j2 < -7=^^ |Ajj2. 

Vn V'Pmm(2s) 

Proof of Theorem 6.1. Set A = 3^, - f3* and Jq = J {(3*). Using Lemma 
2 with X = P* we get that on the event B (i.e., with probability at least 
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1 - M^-^'/2). i|X^XA|oo < 2r, and (ii) inequality (3.3) holds with 
Co = 1. Therefore, on B we have 



(A.ll) -\XA\l = -A^X^XA 

n n 

1 



< 
n 



X^XA 



< 2r[\Aj^\i + \Ajc\i^ 

< 2(l + co)r|AjJi 

< 2(l + co)rVi|Ajj2 = 4r^|Ajj2 

since cq = 1. From Assumption RE(s,l) we get that 

^\XA\l>K^\Aj,\l 

where k = k.{s, 1). This and (A.ll) yield that, on B, 

(A.12) -\XA\l < 16r^s/K^, \AjJ2 < ^r^sjK^. 

n 

The first inequality in (A.12) implies (6.5). Next, (6.4) is straightforward 
in view of the second inequality in (A.12) of the following relations (with 
CO = 1): 

(A.13) |A|i = |AjJi + |Ajc|i < (l + co)|AjJi < (1 + cq) Ajja 

that hold on B. It remains to prove (6.6). It is easy to see that the /cth 
largest in absolute value element of Ajc satisfies |Ajg|(;j) < \Aj^\\/k. Thus 

(A.14) \Aj.jl<\Aj.\l Y^<ll\^4l 

k>m+l 



and since (3.3) holds on B (with cq = 1) we find 

|Ajc I2 < "A^d^ < co|Ajj2 .[^ < col Aj„j2 



jm \ m \ m 

Therefore, on B, 

(A.15) |A|2< (1 + 00^^)1 Aj„j2. 

On the other hand, it follows from (A.ll) that 



-\XA\l<Ar^s\Aj,A2. 



n 
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Combining this inequality with Assumption RE(s,m,l) we obtain that, on 

Recalhng that cq = 1 and applying the last inequality together with (A. 15) 
we get 

(A.16) |A|i < 16 ("l + co-^") {ryfs/K^f. 



It remains to note that (6.6) is a direct consequence of (6.4) and (A.16). 

:t£i aj < bi and E^li a] 



This follows from the fact that inequalities J2jLi ^ &i and J2j=i o-] ^ ^2 



with aj > imply 



"J 

M M / M \ / M \ ^"^ 



E = E < E E «l 1 < bl~'br\ V 1 < p < 2. 



Proof of Theorem 6.2. Set A = 3 - /3* and Jq = J(/9*). Using (2.7) 
where we put A = (3*, Vn j ^ r and ||fA — f\\n = we get that, on the event 
A, 

(A.17) ^Ali < 4rV^|Ajj2 

and (3.3) holds with cq = 3 on the same event. Thus, by Assumption RE(s,3) 
and the last inequality we obtain that, on A, 

(A.18) ^I^^l2 - 16r2■5/'^^ lAjob < 4rV^/K;2 

where k = k{s,3). The first inequality here coincides with (6.8). Next, (6.9) 
follows immediately from (2.9) and (6.8). To show (6.7) it suffices to note 
that on the event A the relations (A. 13) hold with cq = 3, to apply the 
second inequality in (A.18) and to use (A.l). 

Finally, the proof of (6.10) follows exactly the same lines as that of (6.6): 
the only difference is that one should set cq = 3 in (A. 15), (A.16), as well as 
in the display preceding (A. 15). 
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